Regression

Fitting data.

Formula

$$ \begin{aligned} \underline{w} &= \arg\max_{\underline{w}} \mathcal{L}(\underline{w}) \\ &= \arg\max_{\underline{w}} \sum_{i=1}^{N} \ell(\underline{w}, \underline{x}^{(i)}; t^{(i)}) \\ &= \arg\max_{\underline{w}} \sum_{i=1}^{N} \left[ \Phi(\underline{x}^{(i)}) \cdot \underline{w} - t^{(i)} \right]^2 \end{aligned} $$

One way to deal with functions that aren’t linearly separable is to augment your feature space(by $\Phi$) so that the target function is linearly separable in the augmented space.

Problem: Overfit

Too many features and overfit, because data is not enough.

Solution: Cross Validation and Regularization.

Cross Validation

Try different feature size and determine where overfit starts. It requires a lot of data and is slow: if data is scarce, we would have fewer data than feature size after separation; if data is abundant, each round takes a lot of time. Also it is impractical when certain features are of interest.

Regularization

Penalize complicated answers. It trades training performance against solution complexity. $$ \begin{aligned} \underline{w} &= \arg\max_{\underline{w}} \left[ \lambda \Vert\underline{w}\Vert_2 + \mathcal{L}(\underline{w}) \right] \\ &= \arg\max_{\underline{w}} \left( \lambda \Vert\underline{w}\Vert_2 + \sum_{i=1}^{N} \left[ \Phi(\underline{x}^{(i)}) \cdot \underline{w} - t^{(i)} \right]^2 \right) \end{aligned} $$

Natation:

Other ways to get sparseness: Forward Selection(starting with a small feature set and gradually add features until performance stops improving in cross validation) and Backward Elimination(starting with all features and gradually remove features). The issue is that both methods are greedy and can be slow.

Problem: Bias-Variance Tradeoff

Model with low bias usually have high variance. For instance, a squiggly line that fits train data well (low bias) might have a high variance (difference in fit between train set and test set). The idea model has low bias and low variance.

Terms:

Situations:

When data is sparse, variance can be a problem. Restricting hypothesis space can reduce variance at the cost of increasing bias. When data are plentiful, variance is less of concern.

Bayesian View

$$ \begin{aligned} P[D | H] \cdot P[H] &= P[\underline{y} | f(X; \underline{w}) ] \cdot P[\underline{w}] \\ &= \prod_{i=1}^{N} \frac{e^{-\frac{[y^{(i)} - f(\underline{x}^{(i)})]^2}{2 \sigma^2}}}{\sqrt{2\pi \sigma^2}} \cdot P[\underline{w}] \\ &= \prod_{i=1}^{N} \frac{e^{-\frac{[y^{(i)} - f(\underline{x}^{(i)})]^2}{2 \sigma^2}}}{\sqrt{2\pi \sigma^2}} \cdot \frac{e^{-\frac{\alpha \underline{w}^T \cdot \underline{w}}{2}}}{\frac{2\pi^{\frac{(k+1)}{2}}}{\alpha}} \\ \log P[D | H] \cdot P[H] &\approx \sum_{i=1}^{N} [y^{(i)} - f(\underline{x}^{(i)})]^2 + \lambda \Vert \underline{w} \Vert_2 \end{aligned} $$

by Jon